1 Goal

Today I chose a dataset retrieved from Kaggle. With data regarding the air quality of Delhi, I want to try to create a normal distribution of some of the data

import pandas as pd
import numpy as np

df = pd.read_csv('data/day25/delhi_air_quality.csv')

df.head(5)

	Date	Month	Year	Holidays_Count	Days	PM2.5	PM10	NO2	SO2	CO	Ozone	AQI
0	1	1	2021	0	5	408.80	442.42	160.61	12.95	2.77	43.19	462
1	2	1	2021	0	6	404.04	561.95	52.85	5.18	2.60	16.43	482
2	3	1	2021	1	7	225.07	239.04	170.95	10.93	1.40	44.29	263
3	4	1	2021	0	1	89.55	132.08	153.98	10.42	1.01	49.19	207
4	5	1	2021	0	2	54.06	55.54	122.66	9.70	0.64	48.88	149

# Get an understanding of the 
df.describe()

	Date	Month	Year	Holidays_Count	Days	PM2.5	PM10	NO2	SO2	CO	Ozone	AQI
count	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000	1461.000000
mean	15.729637	6.522930	2022.501027	0.189596	4.000684	90.774538	218.219261	37.184921	20.104921	1.025832	36.338871	202.210815
std	8.803105	3.449884	1.118723	0.392116	2.001883	71.650579	129.297734	35.225327	16.543659	0.608305	18.951204	107.801076
min	1.000000	1.000000	2021.000000	0.000000	1.000000	0.050000	9.690000	2.160000	1.210000	0.270000	2.700000	19.000000
25%	8.000000	4.000000	2022.000000	0.000000	2.000000	41.280000	115.110000	17.280000	7.710000	0.610000	24.100000	108.000000
50%	16.000000	7.000000	2023.000000	0.000000	4.000000	72.060000	199.800000	30.490000	15.430000	0.850000	32.470000	189.000000
75%	23.000000	10.000000	2024.000000	0.000000	6.000000	118.500000	297.750000	45.010000	26.620000	1.240000	45.730000	284.000000
max	31.000000	12.000000	2024.000000	1.000000	7.000000	1000.000000	1000.000000	433.980000	113.400000	4.700000	115.870000	500.000000

Thus we see that there is four years of data available, with recordings everyday for those four years. It would now be interesting to plot the PM2.5 column.

df.columns

Index(['Date', 'Month', 'Year', 'Holidays_Count', 'Days', 'PM2.5', 'PM10',
       'NO2', 'SO2', 'CO', 'Ozone', 'AQI'],
      dtype='object')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            1461 non-null   int64  
 1   Month           1461 non-null   int64  
 2   Year            1461 non-null   int64  
 3   Holidays_Count  1461 non-null   int64  
 4   Days            1461 non-null   int64  
 5   PM2.5           1461 non-null   float64
 6   PM10            1461 non-null   float64
 7   NO2             1461 non-null   float64
 8   SO2             1461 non-null   float64
 9   CO              1461 non-null   float64
 10  Ozone           1461 non-null   float64
 11  AQI             1461 non-null   int64  
dtypes: float64(6), int64(6)
memory usage: 137.1 KB

import altair as alt

alt.Chart(df).mark_point().encode(
    x='Month',
    y='PM2.5'
)

Can’t plot the PM2.5 column for whatever reason.

df = df.rename(columns={'PM2.5': 'PM2_5'})

# Trying againg with the new column name
alt.Chart(df).mark_point().encode(
    x='Month',
    y='PM2_5'
)

That did the trick. We can clearly see that PM2.5 particals are generally lowest in July-September. With December and January being the worst. There is however an outlier in June with a PM2.5 of a 1000, maybe the instrument that measured couldn’t read above that threshold.

2 Calculating the normal distribution for 2024 of PM2.5

import math
import matplotlib.pyplot as plt

df_2024 = df[df['Year'] == 2024]

def normal_pdf(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2 * math.pi)
    return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))

# Storing the mean value of PM2.5 in 2024
mu = df_2024['PM2_5'].mean()

# Storing the standard deviation of PM2.5
sigma = df_2024['PM2_5'].std()

# Remove outlier at 1000 PM2_5
df_2024 = df_2024[df_2024['PM2_5'] < df_2024['PM2_5'].quantile(0.99)]

# Creating a array of continuous values to plot probability for each value. 
# As the pm2_5 column can't be used as-is, due to it missing values in the values between min and max
xs = np.arange(min(df_2024['PM2_5']), max(df_2024['PM2_5']))

# Storing y values of the function
y = []
for x in xs:
    y.append(normal_pdf(x, mu=mu, sigma=sigma))

# plotting distribution
plt.plot(xs, y)
plt.title("Normal distribution of PM2.5 in Delhi 2024")
plt.show()

3 Reflections

We thus have a probability density distribution, where we can understand the probability of PM2.5 being any given value.
Besides calculating the normal distribution, it could be interesting to use linear regression, to be able to approximate the pm2.5 on any given day.